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Abstract 

The aim of this article is to propose a novel kernel estimator of the baseline func¬ 
tion in a general high-dimensional Cox model, for which we derive non-asymptotic 
rates of convergence. To construct our estimator, we first estimate the regression 
parameter in the Cox model via a Lasso procedure. We then plug this estimator 
into the classical kernel estimator of the baseline function, obtained by smoothing 
the so-called Breslow estimator of the cumulative baseline function. We propose 
and study an adaptive procedure for selecting the bandwidth, in the spirit of Gold- 
enshluger and Lepski (2011). We state non-asymptotic oracle inequalities for the 
final estimator, which reveal the reduction of the rates of convergence when the 
dimension of the covariates grows. 

Keywords: Cox’s proportional hazards model; Conditional hazard rate function; 
Semi-parametric model; High-dinrensional covariates; Counting processes; Kernel 
estimation; Goldenshluger and Lepski method, Non-asymptotic oracle inequalities; 
Survival analysis 
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1 Introduction 


The Cox model, introduced by Cox (1972), is a regression model often considered in 
survival analysis to relate the distribution of a time T to the values of covariates. The 
hazard function of T is then defined by 

A 0 (t, Z) = a 0 (f)exp(^Z), (1) 

where Z = (Z 1? ..., Z p ) T is a p-dimensional vector of covariates, /3o = (/?oi, ■■■, Po p ) T the 
vector of regression coefficients and «o the baseline hazard function. 

The regression parameter /3 0 and the baseline function a 0 are the two unknown param¬ 
eters in this model. Yet, more attention has been paid to the estimation of the regression 
parameter than to the estimation of the baseline function. 

There are good reasons for this. First, the Cox partial log-likelihood, introduced by 
Cox (1972), allows to estimate /3 0 without the knowledge of a 0 . Secondly, the regression 
parameter is directly related to the covariates. Therefore, in order to select the relevant 
covariates that explain the best the survival time, we need to estimate the regression 
parameter. A lot of papers deal with the problem of the estimation of /3o, the number of 
covariates p being large or not compared with the size of the panel n. When p is smaller 
than n, the usual estimator of /3 0 is obtained by maximizing the Cox partial log-likelihood 
(see Andersen et al. (1993) as a reference book). When the number of covariates grows, 
the Lasso procedure is often considered. This procedure consists in the minimization of 
the opposite of the ^-penalized Cox partial log-likelihood. Asymptotic results are stated 
in Bradic et al. (2012), Kong and Nan (2012), Bradic and Song (2012). Finally, the non- 
asymptotic rate of convergence of the Lasso is now known to be of order ^ Jlogp/n , see 
Huang et al. (2013). 

The estimation of the baseline function «o has been less studied. The known estimator 
of the baseline function is a kernel estimator, introduced by Ramlau-Hansen (1983a;b). 
We present here its form in the special case of right-censoring. Let us consider, for 
the moment, that we observe for i = 1, ...,n, (X^S^ Zi), where X, = miri(T ?; , C(), <5, = 
l{Ti<Ci}> is the time of interest and C t the censoring time. The usual kernel estimator 
is then obtained from an estimator of the cumulative baseline function Aq defined by 
Ao(£) = Jo a'o('‘>)ds- This estimator is called the Breslow estimator and is defined, for 
t > 0, by 


” S- 

A 0 (t,f3) = J2 ~ 7 T 7 An with S n (t,P) = Y, exp (P T Zi), (2) 

i= 1 >-> n \Xi , p) i:Ti>t 

see Ramlau-Hansen (1983b) and Andersen et al. (1993) for details. From Ao(.,/3), the 
kernel function estimator for «o is derived by smoothing the increments of the Breslow 
estimator, ft is defined by 

= \ r -° ^ 
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with K :R4la kernel with integral 1, and h a positive parameter called the bandwidth. 
This estimator has been introduced and studied by Ramlau-Hansen (1983a;b) within the 
framework of the multiplicative intensity model for counting processes, thereby extending 
its use to censored survival data. Consistency and asymptotic normality are proven in 
Ramlau-Hansen (1983b) with fixed bandwidth. 

The choice of the bandwidth in kernel estimation is crucial, in particular when one is 
interested in establishing non-asymptotic adaptive inequalities. State-of-the-art methods 
are based on cross-validation. Ramlau-Hansen (1981) has suggested the cross-validation 
method to select the bandwidth but without any theoretical guarantees. For randomly 
censored survival data, Marron and Padgett (1987) have shown that the cross-validation 
method gives the optimal bandwidth for estimating the density: the ratio between the 
integrated squared error for the cross-validation bandwidth and the inhmum of the inte¬ 
grated squared error for any bandwidth almost surely converges to one. Gregoire (1993) 
has considered the cross-validated method suggested by Ramlau-Hansen (1981) for the 
adaptive estimation of the intensity of a counting process and has proved some consistency 
and asymptotic normality results for the cross-validated kernel estimator. However, all the 
results for the adaptive kernel estimator with a cross-validated bandwidth are asymptotic. 

No non-asymptotic oracle inequalities have to date been stated for the kernel estimator 
of the baseline function. In addition, to our knowledge, the construction of af t has not yet 
been considered for high-dimensional covariates. The objective of the present paper is then 
twofold: whatever the dimension, we aim at proposing an estimator dr of the baseline 
function, for which we can establish a non-asymptotic oracle inequality to measure its 
performances. The loss of prediction of |o:' 3 — ce 0 1 when p increases will be quantified. 

To fulfill these two purposes, the idea is to estimate first the regression parameter /3 0 
via a Lasso procedure applied to the Cox partial log-likelihood, then to plug this estimator 
in the usual kernel estimator (3) of the baseline hazard function and finally to select a 
data-driven bandwidth, following a procedure adapted from Goldenshluger and Lepski 
(2011). In the latter, the problem of bandwidth selection in kernel density estimation is 
addresses and an adaptive estimator is derived, which satisfies non-asymptotic minimax 
bounds. This method was then considered by Doumic et al. (2012) to estimate the 
division rate of a size-structured population in a non-parametric setting, by Bouaziz et al. 
(2013) to estimate the intensity function of a recurrent event process and by Chagny 
(2014) for the estimation of a real function via a warped kernel strategy. In the present 
paper, we consider it to obtain an adaptive kernel estimator of the baseline function 
with a data-driven bandwidth. We establish the first adaptive and non-asymptotic oracle 
inequality, which warrants the theoretical performances of this kernel estimator. The 
oracle inequality depends on the non-asymptotic control of \$ — /3 0 |i deduced from an 
estimation inequality stated by Huang et al. (2013) and extended to the case of unbounded 
counting processes (see Guilloux et al. (2015) for details). 

The paper is organized as follows. In Section 3, we describe the two-step procedure 
to estimate the baseline function: first, we describe the estimation of /3o as a preliminary 
step and give the bound for |/3 — /3 0 |i and then we focus on the kernel estimation of a 0 
and describe the adaptive estimation procedure of Goldenshluger and Lepski to select a 
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data-driven bandwidth. In Section 4, we establish a non-asymptotic oracle inequalitie for 
the adaptive kernel estimator. The fundamental proofs are gathered in Section 6. Lastly, 
a supplementary material provides some technical results needed in the proofs. 


2 Notations and preliminaries 

2.1 Framework with counting processes 

Consider the general setting of counting processes, which embeds the classical case of right 
censoring. We follow here the now classical setting of Andersen et al. (1993) or Fleming 
and Harrington (2011). For n independant individuals, we observe for i = 1 a 
counting process At,;, a random process FJ with values in [0,1] and a vector of covariates 
Zi = ..., Z ip ) T G W. Let (f/J 7 , P) be a probability space and (-Ff)t>o be the 

filtration defined by 

7 t = a{W(s),y)(s),0 < s <t,Zi,i = 1, ...,n}. 

From the Doob-Meyer decomposition, we know that each N t admits a compensator de¬ 
noted by Aj, such that Mi = N t — Aj is a (J r t)t >o local square-integrable martingale (see 
Andersen et al. (1993) for details). We assume in the following that W satisfies an Aalen 
multiplicative intensity model. 

Assumption 2.1. For each i — 1, ...,n and all t > 0, 

A i(t)= f A 0 (s, Zi)Yi(s)ds, (4) 

Jo 

where \ 0 (t,z) = a 0 (t)e^ Tz , for z G M p . 

This general setting, introduced by Aalen (1980), embeds several particular examples 
as censored data, marked Poisson processes and Markov processes (see Andersen et al. 
(1993) for further details). This framework generalizes the case considered in Ramlau- 
Hansen (1983b) to unbounded counting processes and hence widens the scope of appli¬ 
cations: we can consider the jumps of the counting to happen at times of relapse from a 
disease in biomedical research, times of monetization in marketing, times of blogging in 
social network study, etc. 

2.2 Notations 

For a real number q > 1 and a function / : M i—)• M such that \f\ g is integrable and 
bounded, we consider 


\ 1 /q 

\f(x)\ q dx) and 


sup | /(:r) |. 
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The integrals and the supremum are restricted to the support of / and for r a positive 
real number, we set ||/||oo,r = sup x6 r 0)r i 1/0*01 and we simply denote by ||.|| 2 the L 2 -norm 
restricted to the interval [0, r], so that 

Mil= r f (x)dx. 

Jo 

For h a positive real number, we define fh(-) = f(-/h)/h. For square-integrable functions 
/ and g from R to R, we denote the convolution product of / and g by f * g. For a vector 
be R p and a real q > 1, we denote \b\ q = (X)j= i \bj\ q ) l ^ q - 

For quantities 7 ( 71 ) and 77 ( 71 ), the notation 7 ( 77 ) < 77 ( 77 ) means that there exists a 
positive constant c such that 7 ( 77 ) < 077 ( 77 ). 

Finally, let Z e R p denote the generic vector of covariates with the same distribution 
as the vectors of covariates Z,; f of each individual i and by Zj its j-th component, namely 
the j-th covariates of the vector Z. 


3 Estimation procedure 

In this section, we describe the two-step procedure to estimate the baseline function. We 
begin by recalling the usual estimation of the regression parameter /3q in high-dimension. 
We then focus our study on the second step, which consists in the adaptive kernel esti¬ 
mation of the baseline function a 0 . 


3.1 Preliminary estimation of /3 0 

The regression parameter /3 0 is estimated via a Lasso procedure applied to the so-called 
Cox partial log-likelihood introduced by Cox (1972) and defined, for all f3 e R p , by 


W) 


1 , r T Zi 

!°g g /. flx diVifo), where S n (/3,t) 

n~i J 0 


^e^W(t). 


n U 1 


(5) 


The estimator /3 of /3 0 is then defined by 


$ = argmin{-/*(^) +pen(/3)}, with pen(^) = T n \/3\i, (6) 

PgB(o,r) 


where F n is a positive regularization parameter to be suitably chosen and B( 0, R) is the 
ball defined by 

B{ 0, R) = {beW : |6|j. < R}, with R > 0. 

The ball constraint has already been considered by van de Geer (2008) or Kong and Nan 
(2012). Roughly speaking, it means that we have restrict our attention to a, possibly 
very large, ball around /3 0l for which the following (very mild) assumption is needed. It 
is required to control the kernel estimator of the baseline function /3 0 . 

Assumption 3.1. We assume that |/3o|i < + 00 . 
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Concerning the covariates, we introduce the following assumption. 

Assumption 3.2. There exists a positive constant B such that for all j G {1, ...,p} ; 

\Zj\<B. 

Assumption 3.2 is a classical assumption in the Cox model to obtain oracle inequalities 
(see Huang et al. (2013) and Bradic and Song (2012)) and seems reasonable since in 
practice 

We know give a general version of the estimation inequality of Theorem 3.1 of Huang 
et al. (2013). We refer to Guilloux et al. (2015) for a proof of Proposition 3.3 in the 
general case. 

Proposition 3.3. Let k > 0, c > 0 and s := Cardjj G {1, ...,p} : /5 0 ^ 0} be the sparsity 
index of /3 0 . Assume that ||a 0 ||oo,T < oo. Then, under Assumptions 3.1 and 3.2, with 
probability larger than 1 — cn~ k , we have 

I $-Poh<C(s)^l (7) 

where C(s) >0 is a constant depending on the sparsity index s. 

In the rest of the paper, the conditions of Proposition 3.3 will be fulfilled, so that f3 
satisfies Inequality (7). The assumption ||ao||oo,r < oo is to found in Assumptions 3.4. 


3.2 Estimation of oq 

In this subsection, we define the kernel estimator of the baseline hazard function ao on 
which our procedure relies. We state some functional and kernel assumptions, and we 
describe the Goldenshluger and Lepski procedure to select a data-driven bandwidth. 


3.2.1 Kernel estimator 

We first recall the definition of the kernel estimator introduced by Ramlau-Hansen (1983b) 
by using kernel functions to smooth the increments of the non-parametric Breslow esti¬ 
mator (2) of the cumulative intensity. 

Let define K : M —> M a kernel, namely K is a function such that / R A'(a;)dx = 1. 
The usual kernel function estimator iof a 0 is then defined by 




S n (u,$) 


d Ni(u), 


( 8 ) 


with 


Y 


1 n 

alld Sn{u,/3) 


1 71 

- e^ Zi Yi(u) , for all /3 G M p . 

n i =1 


6 






The parameter h > 0 is called the bandwidth. In kernel function estimation, the band¬ 
width has to be chosen by the user. Gregoire (1993) has defined a cross-validation proce¬ 
dure for selecting the bandwidth for the smooth estimate of intensity in the Aalen counting 
process. To our knowledge, all theoretical results for the kernel function estimator (8) with 
a bandwidth selected by cross-validation are asymptotic. The cross-validation ensures no 
theoretical adaptive guarantees when the size of the panel n is fixed and not so large as it 
is the case for medical surveys where only a few patients can be observed. This explains 
our interest in providing a data-driven method to select automatically the bandwidth 
and obtain a kernel function estimator, for which we can warrant some non-asymptotic 
properties. 

In what follows, we denote the estimator under study by df in which the Lasso esti¬ 
mator (6) has been plugged. 

3.2.2 Functional and kernel assumptions 

Classical conditions are required on the intensity function and the kernel K. 

Assumption 3.4. 

(i) For all i G {1, the random process Y l takes its values in {0,1}. 

(ii) For S(t,j3o) = K[e^ z ‘Y l (t)\, there exists a positive constant cs such that, 

S(t,p 0 )>c s , Vt G [0, t\. 

(in) ||a 0 ||oo,T := sup tG[0r] a 0 (t) < oo. 

Assumption 3.4.(i) is satisfied for all the examples quoted in the introduction. In fact, 
this assumption is needed to ensure that the random process Y % has a lower bound when 
it is nonzero. We could also have considered a modified estimator of S n (u,/3), defined 
by (5), as it is usually done in the censoring case without covariates. Assumption 3.4. (ii) 
is common in the context of estimation with censored observations (see Andersen et al. 
(1993))). Assumption 3.4.(iii) is required to obtain Lemma 6.1 and Theorem 4.1 below. 
Nevertheless, the value ||ao||oo,r is not needed to compute the estimator (see Section 5). 

The following assumptions are fulfilled by many standard kernel functions and are 
standard in kernel function estimation. 

Assumption 3.5. 

(i) Halloo = sup ueK \K(u)\ < oo and \\K\\\ = f R K 2 (u)du < oo. 

(ii) nh > 1 and 0 < h < 1. 

(iii) The kernel K is of order 1, i.e. for j G {0,1,2} the function x H > x^K(x) is 
integrable and 

/ xK[x)dx = 0 and / x 2 K(x)dx < oo. 

■j M J ]R 
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Assumptions 3.5. (i) and 3.5. (ii) are rather standard in kernel density estimation (see 
Goldenshluger and Lepski (2011)) and has also been considered in the kernel intensity 
estimation by Bouaziz et al. (2013). Assumption 3.5.(iii) is only required to ensure that 
Kh * o>o(t) — > cno(t) for all t G [0, r], 

h —^0 

Remark 3.6. In this paper, we do not assume that the kernel K has a compact support, by 
opposition to Bouaziz et al. (2013). The Breslow estimator (8) and the pseudo-estimator 
(19) are then well defined for all t G [0, r]. 

3.2.3 Collection of estimators 

Let TL n be a grid of bandwidths h > 0, satisfying the following assumptions: 

Assumption 3.7. 

(i) Card (FL n ) < n. 

(ii) For some a > 0, E heu n ~r ^ lo g a (™)- 

(iii) For all b > 0, E heu n exp(— b/h) < +cx). 

Assumptions 3.7.(i)-(iii) mean that the bandwidth collection should not be too large. 
Let us give an example of grid FL n that satisfies the three previous assumptions. 

Example 3.8 (Example of FLn). The following grid is considered in the simulations in 
Section 5 

FL n = |hj = ^,j = 1,--, Llog(n)/log(2)J j, 

where £ G [0, r/2] is a small constant chosen arbitrarily as close as possible to 0. For this 
grid, all the assumptions required on the bandwidths are verified. Indeed, Card('H n ) < 
log(n)/log(2) < n and \/k = 1,..., [log(n)/log(2)J, we have hj G [n -1 ,l]. Moreover, 
Assumption 3.7. (ii) holds true since 

1 1 Llog(rx)/log(2)J 

L ^ = - E * = 0 ( 1 ). 

r-h,: ■}{,. " j=1 

Lastly, 

Llog(n)/log(2)J 

E exp (~b/h,)= y e-“’=0( 1) 

j-.hj&in 3=1 

and Assumption 3.7. (iii) is verified. 

On the grid FL n , we obtain a set of kernel estimators J r (FL n ) = {df, h G FL n } of the 
baseline function o 0 from the definition (8). 



3.2.4 Adaptive selection of the bandwidth 

We wish to automatically select a relevant bandwidth h G H n , in such a way to then 
be able to select a kernel estimator among the set As usual, we must choose 

a bandwidth h which realizes the best compromise between the squared-bias and the 
variance terms. The choice should be data-driven. For this, we use a quite recent method 
introduced by Goldenshluger and Lepski (2011) for the problem of density estimation. The 
"Goldenshluger and Lepski method" has only been considered in two different settings: 
Bouaziz et al. (2013) has applied this method to provide an adaptive kernel function 
estimator of the intensity function of a recurrent event process and Chagny (2014) has 
used it to estimate a real valued function from a sample of random couples (see Chagny 
(2014)). Lately, Chagny (2013) has also proposed a "mixed strategy", which consists in 
applying the "Goldenshluger and Lepski method" to select the relevant model in model 
selection methods for real valued function in regression models. We consider this method 
to obtain an adaptive kernel function estimator of the baseline function, for which we 
establish a non-asymptotic oracle inequality. 

Let us begin to describe the method. We can explain the idea of the method of 
Goldenshluger and Lepski (2011) from an heuristic proposed by Chagny (2013). We want 

to define djC so that the risk is as close as possible as 

min{||a 0 - K h * a 0 \\\ + V (h)}, 

heHn 


with 


V(h) = / , || Q! o|| 00 , T E[e 2/9 ° Zl ]T + E[e /3 ° Zl ]") 

Cg \ J 


I K 


nh 



1 


for a constant k > 0 . In order to get closer from the bias term 11cvo — Kh*ao\\%, we replace 

ao with an estimator df, (with a fixed bandwidth h'), so that we obtain ||df, — Kh*&^,\\\. 
However, unlike the bias term, this quantity is random and thus contains some variability. 
We need to correct this variability by deducting the part of the variance V(h'). Lastly, 
since there are no reason to choose one bandwidth h' G T-L n rather than an other one, we 
consider the entire collection and take the maximum over this collection. 

Formally, we define for h G l~L n 


A p (h) = sup \ \\a^ h , -a p h ,\\l -V{h!) 
h'eu 


where 


«?,/»'(*) = K h ,*a.P(t), 

for any t > 0 and h, b! two positive real numbers, and 


Kh) 


V(h) = k "^ U|I °°’ t ' ([|a 0 || oo . r E[e 2/3 ° Zl ]T + Efe^ 1 


I K 


nh 


(9) 

( 10 ) 


( 11 ) 
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for some numerical constant k > 0. A data-driven equivalent of this variance term is 
given in Section 5. The choice of k is also discussed. 

From these definitions, we deduce the following choice of the bandwidth: 

hP = argmin{hl^(/i) + V(h)}. (12) 

h£H n 

Our adaptive kernel estimator is then a?*. 


4 Non-asymptotic bounds for the kernel estimator 


Now, let us state the main theorems of the chapter, which provide the first non-asymptotic 
oracle inequality for the adaptive kernel baseline estimator in high-dimension. 


Theorem 4.1. Under Assumptions 3.1, 3.2, 3.f.(i)-(iii) and 3.5. (i)-(iii), %f'H n is a finite 
discrete set of bandwidths such that 3.7. (i)-(iii) are satisfied, then there exists a constant 
k such that a f l fl defined by (11), (9) and (12) satisfies for n large enough and k > 12: 


E[||^-a„|ia<Cmf 


oih — Qfolli + V(h) r + C'(s) 


log a (n) log (pn k 


n 


(13) 

(14) 


with 


V(h) = k 


IK||c 


rT 


r 2 

C S 




r E[e 


2 PlZ 1 


T 


+ E[e' 




I K 


12 

Il 2 (e) 


nh 


where C is a numerical constant, C'(s ) a constant depending on t, B, \fio\n R, 


1111oo,r; Cs> ||-^ 


| K 11 fj 2 (jj) and on the sparsity index s of /3 0 . 


This inequality ensures that the adaptive kernel estimator automatically makes 

the squared-bias/variance compromise. The selected bandwidth h^ is performing as well 
as the unknown oracle: 

h oracle : = axgminE[||aJ — afolll], 
hen„ 

up to the multiplicative constant C and up to a remaining term of order log a (n) log (pn k )/n, 
which is negligible. In Inequality (16), the inhmum term is classic in such oracle inequal¬ 
ities for kernel estimators: the bias term 11 — cro 111 decreases when h decreases and the 

variance term V ( h) increases when h decreases. The remaining terms are of order 


log a (n) log(pn fc ) k\og a+1 (n ) log a (77.) log(p) 

n n n 

Chagny (2014), in the context of an additive regression model, has established an oracle 
inequality for the kernel estimator of the real-value regression function with a remaining 
term of order 1/n. In the context of the estimation of the intensity of a recurrent event 
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process observed under a standard censoring scheme but without covariates, Bouaziz et al. 
(2013) have a logarithm term which appears in their oracle inequality with a remaining 
term of order log 1+a (n)/n instead of the expected 1/n. This logarithm term comes from 
the control in log (n)/n between the distribution function G and its modified Kaplan- 
Meier estimator G, which appears in the kernel intensity estimator. The exponent a in 
the remaining term arises from Assumption 3.7.(ii), which is needed for the control of the 
difference between the kernel intensity estimator involving G and a pseudo-estimator that 
does not depend of G. As well as Bouaziz et ah (2013), our kernel estimator depends on 
an other estimator, so that we need Assumption 3.7. (ii) in order to control the difference 
between the kernel estimator (8) and the pseudo-estimator (19). If our kernel estimator 
had not involved another estimator, we would have considered condition )T 1 / /t < k^n^, as 
in Chagny (2014), instead of Assumption 3.7.(ii) . The term in log(p)/n in the remaining 
term comes from the control of \(3 — /3 0 |i given by Proposition 3.3. This term is typical 
of the estimation of the regression parameter /3o when the number of covariates is large. 
There is no hope of capturing up to usual rates in this high-dimensional setting, but the 
loss in the variance term is only of order log p/n. 

When we assume that the counting processes N t are bounded for % = l,...,n, the 
variance term V(h) is simpler and has the same form as the variance term in Bouaziz 
et al. (2013). In this particular case, Theorem 4.1 takes the following form. 


Theorem 4.2. Under the same assumptions as in Theorem 4-1 and assuming also that 
there exists c T > 0, such that Nfit ) < c T almost surely for every t e [0, r\ and i G {1,..., n}, 
there exists a constant k such that oW defined by (9), (12) and 


V b (h) 


_ C t T 11 Oq 11 oq ; t I WIl2(R) 

c s nh 


satisfies for n large enough: 


(15) 


E [ll“A«-ao||2]<mU 


I oih — cxq 11 2 + V b {h) > + C'(s ) 


log a (n) log(np) 


n 


(16) 

(17) 


where C is a numerical constant, C'(s) a constant depending on r, c T , B, |/3 0 |i, R, 
IKIIoo,t, C S , \\K\ |li (R ), ||A'|| L 2 (r) and on the sparsity index s of (3 0 . 

The proof of Theorem 4.2 is close to the one of Theorem 4.1 and we refer to Lender 
(2014) for the details. 


5 Applications 

5.1 Simulation study 

The aim of this subsection is to illustrate the behavior of the kernel estimator of the 

hP 

baseline function in the case of right censoring and to compare it with the usual kernel 
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estimator with a bandwidth selected by cross-validation introduced by Ramlau-Hansen 
(1983b). 

5.1.1 Simulated datas: censored data. 

We consider a cohort of size n and p covariates simulated according to the Cox model (1) 
with right censoring and with regression parameter /3o chosen as a vector of dimension p, 
defined by 

/3 0 = (0.1,0.3, 0.5, 0,0) T G M p , 

for various p > 3. Several choices of n and p have been considered, with sample size 
n taking values n = 200 and n = 500 and p varying between p = y/n, being 15 and 
22 respectively and p = n, referred to as the high-dimension case. For each n and 
p, the design matrix Z = (Zij)i<i<n,i<j< P is simulated independently from a uniform 
distribution on [—1,1] and survival times T l ,i — 1,..., n are simulated according to Weibull 
distributions W(1.5,1) and W(0.5,2). Hence, the associated baseline function has the 
form oio(t ) = aA“t a_1 , where a and A stand for parameters in W(a, A). The censoring 
times Ci, for i = 1, ...,n, are simulated independently from the survival times via an 
exponential distribution ^(l/yEfTij), where 7 is adjusted to the chosen rate of censorship: 
7 = 4.5 for 20% of censorship and 7 = 1.2 for 50% of censorship. 

The time r of the end of the study is taken as the quantile at 90% of (T t A 
For i = 1,..., n, we compute the observed times X t = iniri(T ?: , (%), where Ci = Ci At and 
the censoring indicators <5, = 1 Ti<Ci- The definition of Ci ensures that there exist some 
i € {l,...,n} for which X t > r, so that all estimators are dehned on the interval [0,r] 
and it prevents from a certain edge effect. Each sample ( Zi,Ti,Ci,Xi,8i,i = l,...,n) is 
repeatedly simulated N e = 100 times. 

The compared estimators of the baseline hazard function are both constructed with 
the Epanechnikov kernel, dehned by 

K ( u ) = ^(1 -m 2 )1 {M <i } . 

In both cases we plugg the Lasso regression parameter estimator $ dehned by (6) and 
implemented from the R-package glmnet. 

We compare two procedures for the data-driven choice of h: the Goldenshluger and 
Lepski method with the selected bandwidth denoted by Jiq L and the cross-validation with 

/v a 

the selected bandwidth denoted by h p cv . 

5.1.2 The Goldenshluger and Lepski method 

The adaptive bandwidth selection method, we consider here, is based on the grid of 
band widths 'H n dehned in Example 3.8 by 

— {l/2 fc , k — 0 ,..., |_log(n)/ log(2)J}. 
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In our procedure (9), the variance term V(h) involves unknown quantities, so we consider 
a data-driven equivalent of it and use that the right-censoring context implies that the 
counting processes are bounded. Hence we implement the following procedure: 

^gl = argmin{H^(h) + vf(h)}, 

/iSH„ 

where, for any t > 0 and h, h! two positive real numbers, 

A 0 (h) = sup | \\&h,h' Vb i h ')} > = K h , * df (t), 

h'&Hn f J + 


and 


Vh 


{h) = 


\a 


K 


3 I 

max(h) I 


■Ill'll 


nh 


In the variance term Vf (h), we have replaced the true unknown function ao by an esti¬ 
mator computed for the largest bandwidth h in the grid 'H n (see Bouaziz et al. 

(2013)). The numerical constant k' is a universal constant that we tuned from the com¬ 
parison of the MISEs for several candidate values in the range 10~ 4 — 1000, and for the 
two different distributions W(1.5,1) and W(0.5,2). We take k 1 = 1. 


5.1.3 Cross-validation method 

The bandwidth h^y selected by cross-validation is defined by: 


hcv = argmimj E ^ (df (t)) 2 dt - 2 ^ K ( — , Xj ] - 


Si 


h ) Y{Xi) Y(Xj 


where Y = E(Li 

5.1.4 Performances 

The performances of these two estimators are evaluated via different Integrated Squared 
Errors (ISE). For some function a G L 2 ([0,r]) the standard ISE and the total ISE are 
respectively defined by 

ISEstand(a) = f ( a(t ) — a 0 (t)) 2 dt, 

Jo 

1 41 r>q- 

ISEtotal(a, /3) = — ^ / (a(t)e^ TZi — a 0 (t)e /3 ° Zi ) 2 df. 
n i=1 Jo 

The associated Mean Integrated Squared Errors are defined by MISEg(ai) = E[ISEg(a)], 
for g=stand or total, where the expectation is taken on (T), C 1} Zj) (for sake of simplicity, 
we write MISEg(a) even if the MISE depends on (3). We obtain an estimation of the 
different MISE by taking the empirical mean for the N e = 100 replications. 
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MISEs 

Dimensions 

20% 

50% 

MISEstand 

MISEtot 

MISEstand 

MISEtot 

n = 200 

p — 15 

0.014 

0.017 

0.080 

0.082 

0.023 

0.029 

0.104 

0.120 

p = 500 

0.013 

0.016 

0.117 

0.117 

0.022 

0.026 

0.152 

0.154 

n ~ 500 

p = 22 

0.009 

0.007 

0.038 

0.035 

0.011 

0.012 

0.055 

0.056 

p = 1000 

0.008 

0.008 

0.068 

0.064 

0.011 

0.013 

0.094 

0.096 


Table 1: MISEs of the kernel estimators with a bandwidth selected by the Goldenshlnger 
and Lepski method (first column for each MISE) and with a bandwidth selected by cross- 
validation of the baseline function with a Lasso estimator of the regression parameter, 
given two rates of censoring: 20% and 50% of censoring. 


5.1.5 Results 

Table 1 gives the two empirical MISEs of the kernel estimators with a bandwidth selected 
either by cross-validation or by the Goldenshluger and Lepski method for a Lasso estima¬ 
tor of the regression parameter and survival times that are distributed from W(1.5,1), in 
different censoring situation. We consider the results for two rates of censoring: a usual 
rate of 20% of censoring and large rate of 50% of censoring. 

As expected, witht both procedures, the MISEs are degraded when the censoring rate 
increases. When we compare the standard and total MISEs, the results are rather good for 
the standard MISE. This is consistent, since the total MISE measures the performances of 
the complete intensity estimators A (t, Z) = dr (t)e^ z , including the error coming from 0, 
whereas the standard MISEs measures the performances of the estimators of the baseline 
function. Therefor 

One can see that MISEs resulting from the two procedure are very similar, with very 
rather good results with our procedure. 

In Table 2, we give the standard MISE of the kernel estimators with a bandwidth 
selected either by cross-validation or by the Goldenshluger and Lepski method for different 
distributions of the survival times. We observe that the kernel estimator with a bandwidth 
selected by the Goldenshluger and Lepski method performs better than the one with a 
bandwidth selected by cross-validation for the two Weibull distributions. 

5.2 Application to a real dataset on breast cancer 

In this section, we apply the proposed method to study the relapse free survival (RFS) 
from breast cancer adjusted on high-dimensional covariates in two groups of patients. We 
consider a Cox model (1) to link the RFS to the covariates. We aim at answering the 
two questions of the introduction concerning the biomarkers that influence the RFS and 
the prediction of the RFS for each individual. The dataset is available on the website 
www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE6532. 
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Distributions 
Dimensions — 

W(1.5,1) 

W(0.5,2) 

n = 200 

p = 15 

0.056 

0.088 

1.02 

1.561 

p = 200 

0.06 

0.085 

0.923 

1.556 

n = 500 

p = 22 

0.025 

0.037 

1.006 

1.521 

p = 500 

0.027 

0.033 

1.098 

1.515 


Tabic 2: MISEs for the kernel estimators with a bandwidth selected by the Goldenshluger 
and Lepski method (first column for each distribution) and with a bandwidth selected 
by cross-validation (second column for each distribution), with a Lasso estimator of the 
regression parameter for two different Weibull distributions of the survival times. 


The dataset consists of 414 patients in the cohort GSE6532 collected by Loi et al. 
(2007) for the purpose of characterizing Estrogen Receptor (ER)-positive subtypes with 
gene expression profiles. Estrogen receptors are a group of proteins found inside cells, 
which is activated by the hormone estrogen. There are different forms of estrogen recep¬ 
tors, referred to as subtypes of estrogen receptors. When they are over expressed, they are 
referred to as ER-positive. The dataset has been studied from a survival analysis point 
of view in Tian et al. (2012). Following them, we apply the two procedures to the same 
survival time of interest (the RFS). Excluding patients with incomplete informations, as it 
is done by Loi et al. (2007), there are 142 patients receiving Tamoxifen and 104 untreated 
patients. It should be underlined that we should do better to handle the missing data, 
but in this study we also exclude the patients with missing data. In addition to clinical 
informations such as the age or the size of the tumor, we have 44 928 gene expression 
measurements for each of the 246 patients. Two different survival times are available in 
this study: the time of relapse free survival and the time of distant metastasis free sur¬ 
vival. We are interested in this study in the time of relapse free survival, which subjects 
to right censoring due to incomplete follow-up. There are 60% of censorship in the group 
of the untreated patients and 66% in the group of patients receiving Tamoxifen. Our goal 
is to compare the baseline functions in the two groups of patients: the patients receiving 
Tamoxifen and the untreated patients. 

We start by a preliminary variable selection among the 44928 levels of gene expression. 
This corresponds to a screening step (see Fan et al. (2010)). This preliminary variable 
selection is based on the score statistics of each Cox model considered for each variable 
separately. We only keep the variables which score statistics are superior to a threshold. 
The difference from the procedure proposed by Fan et al. (2010) is that we fix the number 
of covariates we want to keep and then we tune a threshold to select this number of 
covariates. We define the threshold as the 95 t/l percentile of a Chi-squared distribution 
with 1 degree of freedom, so that 996 probesets have been selected and with the clinical 
covariates, we have p = 1000. 

Figure 1 shows the graphs of the kernel estimators of the baseline function with a 
bandwidth selected by cross-validation and by the Goldenshluger and Lepski method, in 
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(a) Untreated patients (p=1000). (b) Tamoxifen patients (p=1000). 

Figure 1: Kernel estimator with a bandwidth selected by cross-validation (in blue) and 
kernel estimator with a bandwidth selected with the Goldenshluger and Lepski method 
(in red). The righthand plot is associated to the group of untreated patients and the 
lefthand plot correspond to the group of Tamoxifen patients for p = 1000. 

the two groups of patients for p = 1000. 

On Figure 1, we observe that the estimator obtained by cross-validation fails to give 
an interpretable estimate of «o for the untreated patients. For the estimator obtained 
from the Goldenshluger and Lepski method, we observe that the risk of relapse to breast 
cancer has slowed down with the treatment, because the estimated baseline function is 
close to 0 until t = 2.5 for the patients treated with tamoxifen whereas it already increases 
at time t = 1.5 for the untreated patients. This leads us to believe that the treatment 
has a positive influence on the survival time. 

6 Proofs 

This section is organized as follows. First, we establish a lemma that allows to control 
the estimation error of the kernel estimator for a fixed bandwidth h , then we prove 
Theorem 4.1 from two fundamental lemmas that are also proved in this section. We add 
a supplementary material for all the other used technical lemmas, that are not essential 
for a first reading. 
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6.1 Intermediate lemma: bound for the kernel estimator of oq 
with a fixed bandwidth 

We first establish a non-asymptotic global bound on Mean Integrated Squared Error 
(MISE) for the estimators dj^, with h fixed. 

Lemma 6.1. Under Assumptions 3.1, 3.2, 3.f.(ii)-(iii) and 3.5.(i)-(iii), for a fixed h > 0, 
n large enough and k > 12 

E[||af - orolll] < 2\\a k - a 0 ||| + A + C 2 ( S )!^dl (18) 

where C\ is a constant depending on r, | |a 0 | |oo,t., c s, E[e^° Zl ] ; E[e 2/3 o Zl ], r and \\K\ | L 2 ( R ) 
and C 2 (s) is a constant depending on B, |/3 0 |i, R, ||«o||oo,r? ds , T , ||A"||l 2 (ir) an d on the 
sparsity index s of /3 0 . 

To prove this lemma and link the kernel estimator to the true baseline function «o, 
the trick is to introduce a pseudo-estimator, which does not depend on /3. Consider for 
h > 0 the pseudo-estimator 


2— I 


K 


t — u 


h J S(u, f3 0 ) 


d Ni(u), 


(19) 


which corresponds to the kernel estimator of a 0 when S(u,/3 0 ) = E[c P° Zi Yi(u)] is known. 
To justify the choice of the pseudo-estimator, let us calculate its expectation: 


1 _ n _ r T 

E M*)] = ^E/ o A- 


t — u 


= K h * a 0 (i), 


h JS(u,/3 0 ) 
a 0 (u)du 


ao(w)E[e /3 ° Zi Yjfu)]du 


which is a unit approximation of a 0 ; so tliat K h * a 0 —> a 0 under mild conditions (see 

h— >0 

Bochner Lemma and Assumption 3.5. (iii)). 

In the following, we define for all t G [0, r] 

a h {t) '■= E [a h (t)] = K h * a 0 (t). (20) 


The proof is based on the following decomposition for h > 0 

E[||df - Qfolli] < 2E[pf - d,|| 2 ] + 2E[||a h - colli]. (21) 

Since the pseudo-estimator (19) does not depend on the estimator /3, the error E[||d/j — 
Col ||] i s easier to bound than directly the error E[||df — c 0 111]- The study of the error of 
df — a 0 is then divided into two parts: the error of dh — ao and the one of df — d^. 


The following lemma provides the classical bias/variance inequality for the pseudo¬ 
estimator (19). 
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Lemma 6.2. Under Assumptions 3-4-(ii)-(iii), 3.5.(i)-(iii), for h > 0 fixed 


E[||a fc - Qfolla] < || ot h - a 0 \\i + 


2 ^ 11 Ot 011 oOjt'T 


E[e 


Po z A 


+ Q!0 


r E[e 


Wo z i] 


I K 


12 

Il 2 (k) 


nh 


( 22 ) 


The next lemma controls the quadratic error between df and d^. The term to be 
controlled in this difference is in fact the difference between the regression parameter /3q 
and its Lasso estimator 0. The £i-norm of this difference is bounded from Proposition 3.3 
by a term of order log (np)/n. This explains the obtained bound in the following lemma. 


Lemma 6.3. Under Assumptions 3-4-(ii)-(iii), 3.5. (i)- (Hi), 3.1 and 3.2, for a fixed h > 0, 

log (n k p) 


E[||a£ -Oih\\ 2 \ < c(s) 


n 


where c(s) is a constant depending on B, |/3 0 |i, R, ||ao||oo,r? c s, r , ||AT||l 2 (r) an d s the 
sparsity index of 0 O . 


From Equation (21), gathering Lemmas 6.2 and 6.3 provide directly Lemma 6.1. 
Lemmas 6.2 and 6.3 are proved in the supplementary material. 


6.2 Proof of the oracle inequality in Theorem 4.1 

For all h € R n , A^{h) is defined by (9) and we can apply this definition for h = h 13 . We 
deduce from this, using Definition (12) of hfi, that for all h G R n 


l%3 “ a 0\ 


< 3||d^ -af w ||l + 3||df w — df 11^ + 3||df — a 0 


htf" 2 


h,hP 


< 3 (A*(h) + V(h $ )) + 3 (Ah(h $ ) + V(h)) + 3||df - a 0 | 

< 6 (A^(h) + V(h)) + 3||df — Q!o111- 


We obtain for h E R n 

E[||of s - oolll] < 6E[^(ft)] + 6 V(h) + 3E[||cif - a„|@. (23) 

Lemma 6.1 gives a bound of E[||df — cxo111]? which reveals the bias term, the variance 
term of order 1 /nh and a remaining term of order log (np)/n, and V(h) is of the expected 
order 1 /nh. E[4 /3 (/i)] is bounded in the following proposition. 


Proposition 6.4. Let h E R n be fixed. Under the assumptions of Theorem 4-U there 
exist constants C\, C 2 (s), C 3 (s) such that, 


/ n M 112 , n { A°g a (n)log(n k p) log (n k p) 

E T ft < Ci \\a h - a 0 2 + C 2 (s)-h C 3 (s)-. 

n n 

where the constant C\ only depends on ||iF||i. 


(24) 


Applying Inequalities (18) and (24) in Equation (23) implies Inequality (16) by taking 
the infimum over h E Ti n . This ends the proof of Theorem 4.1. □ 
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6.3 Proof of Proposition 6.4 

We introduce several additional notations d^' = K h > * a h , apt) = E[d^(t)], a^h'ft) — 
and write 

A$ (h) = sup j||df, - & ih'\\l- v (ti)} 

h'eHn t J + 

< 5 sup (|| a h f - a h >\\l - V(h')/ 10} + 5 sup \\\a> h , h > ~ oi hjh >\\l - V{h')/ lo} 

h'eHn t J + h'eHn t > + 

+ 5 sup || df, -a h > 11| + 5 sup \\a^ h , - a h ^\\l + 5 sup | \a h > - a hjh >\\l 

h'eHn h'eHn ’ h'eHn 

:= 5(Ti + T 2 + T 3 + T 4 + T 5 ) 


• Study of E[Ti] : Recall that for all h G A n 

| lo:*. — || = sup (a h -a h ,f) (25) 

/6L2([0,r]),||/|| 2 =l 

We introduce the centered empirical process v n} h{f) = (d/i — ah, /) 2 , which is equal to 

- £ / /(*) f / y - o Y (dNi(u) - a 0 (u)S(u , /3 0 )dw)^) dt. 
n ./o Wo S(u,Po) ) 


As / 1 —* v n ,h(f) i s continuous, the supremum in (25) can be taken over a countable dense 
subset of {/ G L 2 ([0,r]), H/H 2 = 1}, which we denote by B r . Therefore, we can write 


E[Ti] < E 


\ sup \\a h > - a h >HI 

{h'eHn 


V{h !)/10 


< E E 

h'eHn 


\oth’ — OLh' II 2 — P(^ 0 /lO 


< ^ E 

h’eHn 


sup ^(/) - R(A)/10 

J&Br 



(26) 


Let introduce a key lemma, which allows to bound (26). 

Lemma 6.5. Let us introduced the centered process v nth (f) = ( a h — ah, f) 2 , for any 
h G Lin and f G L 2 ([0,t]) and B T = {/ G L 2 ([0,r]), ||/|| 2 = 1}. Under the assumptions 
of Theorem 4-1, with V(h') defined by (11) for any h! G LL n , there exists two constants c% 
and C 7 depending on the bound Kb of the Burkholder Inequality, r, ||ao||oo,r? the bound cs 
ofS(t,p 0 ), E[e^o Zl ], E[e 2/3 o Zl ], ||A'|| l i ( e) and ||A'|| L 2 (R)7 such that 


E E { SU P u l,h{f) - V(h)/1Q 

heHn l/GBr(fc) 


+ J 


< 


C 6 . 

-h C7 


n 


log a (u) 

n 
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So, from Lemma 6.5, there exists two constants Cq > 0 and C 7 > 0 such that 

wr^i/ c 6 , 1 °S a ( ri ) 

E h <-b c 7 -, 

n n 


( 27 ) 


where c 6 and c 7 depend on r, ||ao||oo,r, Cs, E[ed° Zl ], E[e 2 do z i] } ||it|| L i (M) and ||A'|| L 2 (M) . 
• Study of E[T 2 ] : We study T 2 similarly as T) since 

E[T 2 ]< Y, ^\{\\®h,h' ~ a h , h '\\l h , -V(h')/lo} 


h'GHr 


From Lemma 6.5 (see the remark at the end of the proof of Lemma 6.5), there exists two 
constants c 8 > 0 and c 9 > 0 such that 

log°(n) 


Efzy < ^ + c. 


(28) 


where c 8 and c 9 depend on r, |Klloo,r, c s , E[e^ Zl ], E[e 2/3 « Zl ], \\K || L i (H) and |j/l || L 2 (R) . 


• Study of E[T 3 ] : First, write for all h G H n , that 


«/!(*)-«/»(*) 


r k ( — ) s ~ Sn ^ ^ (u 

nh Jo [ h ) S n (u,$)S(u,P 0 ) 


For all u G [0,r], we have S(u, /3o)l{y( u )>o} — S n (u, P) = {S(u,P 0 ) ~ S n (u, i3))t { y {u)>0} . 
Indeed, for all u G [0, r], if l{y( u )>o} = 0, then for all i G {1, Yi(u) = 0 and 

S n (u, $) = 0. So, we can rewrite for all h G 7~L n that 




t — u 

T 


Consider the following sets: 


S(u,Pq) - S n (u,$) 
S n (u,$)S(u,Po) 


l{y(u)>o}dA' r i('u). 


(29) 


Q H ,k = jw : Vn G [0, r], | S n (u, $) - S(u,p 0 )\ < 2C(s)Be BR e 2B]Ml ( 30 ) 

Q Sn = |w : Vu G [0 ,T],S n (u,$) - S(u,P 0 ) > -||}, ( 31 ) 

^k = H,k n flsy. (32) 

We decompose T 3 on Q*. and on its complement. On let introduce the following 
lemma: 


Lemma 6.6. Under Assumptions 3-4-(ii)-(iii), 3.5.(i)-(iii), 3.1 and 3.2, for all k G N), 

7 np hnvp 

E[pf-d,|| 2 l(^)]<c 3 n 4 -^ ; 

where c 3 is a constant depending on B, |/3 0 |i, R, ||a 0 ||oo,T> c s, T , Halloo- Choosing k > 10 
yields E[||d^ — d/i|| 2 ] < c 3 /n. 
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From Lemma 6.6, 


E 


sup ||dj - a h r\\mn c k ) 

h'&Hn 


< £ E[||fif,-a*||ijl(n£)] 

h'eHu 

< e 

h’&Hn 


which is of order 1 /n as long as k > 12 . On the other hand, from ( 29 ) on flk, we have 


E 


sup / (a%, - a h ') 2 (t)t(Q k )dt 
h'&H n •'0 


< 16 C(s) 2 i? 2 e 2 ' B ' R e 4 ' B l / 3 o l 1 log(pn fc )^ 
c% n 


PT I ('T 


sup / 
h'eHn Jo 


\K h ,{t M)l ' — £diVj(w) 


Jo S(u,Po ) \ra“ 


Then, we decompose A r , : = — A t ) + A, ; to obtain 


E 


PT I PT 


sup 


K h ,(t “)UI^ dJVi(M) jl di 


h'&i n J o \Jo S(u,P 0 ) 

< 2 E 


2 n 


sup 


K, ‘ ^ £diVi(u) - a 0 (u)S(u, p 0 )du) } d t 


h’&Hn-Jo Wo S(u,Po) \n 


PT I PT 


2 sup / 
h'eHnJ 0 I JO 


\K h i(t — u)\a 0 (u)du j d t. 

The term ( 34 ) is bounded by 2 r||ao||^ OT ||A'|| 2 i^ IK ^. Let us bound the term ( 33 ), 


E 


PT I PT 


\K h ,(t - u)\ fl 


2 -i 


sup / , / , s. . ^ 

h’&inJo \ Jo S(u,p o) 


£dlVi(w) - ao(M)S , (M, i flo)du) > df 


< £ / Var 

h'eHn u 


x ‘ ,(t “'hydiViM 


/o S(u,Po) n W 
ft remains to bound the variance term. 


Var 


1 


E 


|/u(t - u)| 


diV,;(w) 


< -E 

n 


n~[Jo S(u,p 0 ) 

We apply the Doob-Meyer decomposition to get 


I K h (t - u) | 


2 -\ 


/ o S(u,p 0 ) 


d NAu) 


Var 


;e 


\K h (t - u )| 


n 


V [Jo S(u,p 0 ) 


d NAu) 


2 

< -E 
n 


-E 


n 


r K h (t - u) 

Jo S(u,p 0 ) 

r K h (t-u ) 
/o S(u,Po) 


2i 


dM { (u) 


( 33 ) 

( 34 ) 


2n 


Q'o(«)e^o Zl y, (u)du 


( 35 ) 


. ( 36 ) 
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The term (35) is bounded by 


-E 

n 


I o (S(u,/3 o)) 2 


a 0 (w)e^° Zl Yi(w)dw 


< 2 | |«o| |oo,rIE[e^o Zl \ | |A | |l 2 ( K ) 


n 


h' 


and from the Cauchy-Schwarz inequality, the term (36) is bounded by 

2||«o|lL, r E[e 2/3 ° Zl ]r||A'||2 2(R) 


n 


r 2 

C S 


From (37) and (38), (33) is bounded by 
4 11 «0 11 oo,r^~ 


n 


E^s^ + iNUrE^; 


I|A'||l2( R ) ^ }r 

h'eHn 


E 


From Condition 3.5. (ii) and bounds (36) and (39), we deduce that 


E 


sup / 

h'eHn Jo 


&£• - a».) 2 (i)l(fit)dt 


(37) 


(38) 


(39) 


<C( S , Cs ,g.fi,|/3o|i.l|a„|U, T .r,||Al| LJm .E[e^ z .],E[e 2 ' ) . Tz >]) 1 ° S ° (,i) | 1 i ° g(?>,i " ) . 


Finally, there exists a constant C 5 > 0 such that 

log a (n) log(n fc p) 


E[T 3 ] < c 5 - 


n 


(40) 


where c 5 depends on s, c s , B, A, r, ||a 0 ||oo,T, \P 0 \ 1 , I |A'| | L 2 (R) , E[e^° Zl ] and E[e 2 / 3 o Zl ], 


• Study of E[T 4 ] : Since 

&h,h' ~ ®h,h’ = Kh’ * (&h ~ ®h), 

we have from Young Inequality (Lemma 2.2 in the supplementary material) with p = 

1 ,q = 2 ,r = 2 , 

Epi] < ||A|| 2 1(R) E[||af - a„|| 2 ] < C( S )||A||£ 1(R) ‘??bY (41) 

where the last inequality is obtained from Lemma 6.3. 


• Study of E[T 5 ] : From Young Inequality (Lemma 2.2 in the supplementary material) 
with p — 1 , q — 2 , r = 2 , we obtain that 

|| OLh< — Oih,h'\\l — II Kh> * (cKo — K h * Ofo) I li < II A Hl1(R)||«0 — A h * «o| ll 
Therefore, since = K h * a 0 , 

EK]<l|Al| 2 , w |K-a,,|| 2 , (42) 

which corresponds to a bias term. 


Finally, gathering the bounds of the five terms (27), (28), (40), (41) and (42), gives 
the result of Proposition 6.4. □ 
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6.4 Proof of Lemma 6.5 


We have to control the sup remum of v n ,h(f) defined by (43) over the ball B T = {g G 
L 2 ([0,t]), ||g|| 2 = 1}. For all h E T-L n and / E B r , we have 


1 _ n _ r r 

Vn,h(f) = ~J2 ftt) 


K h (t 


u 


n 


2=1 ' 


/o S(u,fio) 


(d Ni(u) — oto(u)S(u, /3o)du) I d t. (43) 


Usually, to control such a process, we apply the Talagrand Inequality given in Theorem 
??. However, since v n ,h{f) is not bounded, we can not directly apply the Talagrand 
Inequality: we have to introduce a truncation (see Chagny (2014) for a close approach). 
Let us define for a constant c, 

y/n 

ftn 5 

iogn 

and we decompose v n ^ as 


Ui,h(/) 



(/), 


where 


,(!) 


1 71 f T 


K h (t - u) 


n “ Jo 


1 


/ /(*) / E 


o S(u,Po) 
K h (t - u ) 


1 {iV i (r)< Kn }dW(M)dt 


n — Jo 


S(u,po) 


l{Vi(r)<K n }d-^j(jt) 


df, 


and 



1 n fT 

-T, i m 


K h (t - u) 


n — Jo 


1 71 PT—h PT 

E f m f e 


o S(u,p 0 ) 

K h (t - u) 


l{7V i (r)> Kn }dW(M)df 


n i=l 


S(u,p o) 


1 { AT* (T)>Kn} d-^Vj {u) 


dt. 


• Control of ££/): 

We can apply a Talagrand Inequality to Unlif), which is bounded. To apply this 
concentration inequality, we need to determine the bounds H , M, IF and the con¬ 
stant e (see Theorem ?? in Appendix ?? for the notations). 


— Determination of the constant M: 
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Using the Cauchy-Schwarz Inequality, we have for 


m 


' K h (t - u ) 


o 7 w Jo S(u, /So) 


t{ Nl{T )< Kn }dNi(u)dt 


<\\J\\2 

|l K 

< 




L2(K) |.¥i(r)l{AT 1 (r)< K „}| 


\fh 


I K 


< 


Il 2 (r) 


c s Vh 


cs 


:= M ~ 


n 


log nVh 


Determination of the constant H: 
Let define 


M*) = -5Z 


1 ” r Kbit-u ) 


S(u,/3 0 ) 




We have sup /eBT (^(/)) 2 = sup /& g T (^ - E[^ ft ],/)| = ||^ - E[^ h ]|||. We 
deduce from the Doob-Meier decomposition that 


E 


su PKi(/)) 

f^Br 


d)fn\ 2 \ _ I Var['?/; ft (t)]dt 

< ~ / £11 / _ M ) 
— n Jo 


2i 


1 { Vi (r) <K n } d -W (w) 


< 


0 ^(u,/So) 

2||a< ’ l j°°’ rT (' || ao |U,rE[e 2 ' ! 3' i! ‘]T + E[ e 0« T 


Zl' 


df 

A'" 2 


nh 


:= H 2 


We have i/ 2 = V{K)/k. Then, we set £ 2 = 1/2 and k = 80 in order to have 
2(1 + 2 £ 2 )H 2 = V(h )/20 = 0(l/n/i). 

Determination of the constant ID: 

Since / G B T , we have 


Var 


/w 


Kh{t - u) 


o 7 Jo S(u,/3o) 


l{ATi(T)< Kn }dDi(u)dt 


<E 


<E 


m 


K h {t - u ) 


'o A(u, /3 0 ) 

(K h *f)(u) 


l{Vi(r)< Kn }dDi(u)dt 


2i 


^ i r [K h *j){u) IAr , N 

v 1 ( r)<«n}ly 0 5 ( m?/3o ) <IA::,/) 
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So, from the Doob-Meier decomposition and Young Lemma 2.2 in the supple¬ 
mentary material, we have 


Var 




< 


< 


2 I 1^0 | | 00,7 

c s 

2 I |®o| | 00,7 


r||ao||oo,rE[e 2/3 o z i] + E[e 


-JoZi] 


r||a 0 ||oo,rE[e 2/3 o z i] + E[e' 


-JoZi] 


\\Kh*f\\l 
\\K\\lm -=W 


Then, from Assumptions 3.7.(ii) and (iii), we deduce that 


E E 

heHn 


sup(d!i(/)) 2 -nfc)/20 


V 1 


<E E (e-- + 


n 


hen,, 


nlog ( n)h 


,-tf3 log” 


<h+tf 2 i^ 

n n ” 3 


(44) 


with 


V(h) = k 2 I \ a o\Uc, TT |q, 0 | [oo^Efe 2 ^ Zi ]t + E[e^ Zl ]^j 


I K 11 2 
l 2V I Il 2 (k) 

nh 


Control of E/l (/) : 

/o') 

Now, let us focus on the second unbounded term z/^ ;((/)• Let us consider the process 
T(f) defined as 


1 " 

-E 

n h 


r K h {t - u) 
/o S(u,/3 0 ) 


l{A r i(T)>K n }dYj(rt) E 


r T K h {t - m ) 

/o S(u,fio) 


l{A r i(r)>K n }dYj(rz) 


so that Vn hif ) = / 0 T /(t)'L(t)dt. Using Cauchy-Schwarz inequality, we get 


E 


su p (^ S (/)) 5 

/e£P 


< E 

i r 

< Var 
n Jo 


< - / E 

n jo 


T 2 (t)df 


r K h (t - u ) 

/o S(u,/3 0 ) 

r K h {t - u) 


1 { Wi (r) >«;„ } d Yi (it) 


df 


2n 


l{7Vi(r)>K;„}dAi(rz) 


df 


/o ^(m^o) 

Applying the Cauchy-Schwarz Inequality (see Lemma 2.1 in the supplementary ma- 


25 


































terial), we obtain that for all k > 0, 


E 


SUP (Vnl(f)Y 
/ 


< - / E 
n Jo 


1 {iVi (r) >k„ } Ai ( T ) 


r K 2 h {t - u) 

lo S 2 (u, /3 0 ) 


dNUu) 


d t 


< 


< 


< 


I K 


^2e[JV?(t) 1 {WiW> ^, } ] 

|/t||[» ( .)E[iVf +2 (r)l 
nhc 2 s K k 

I-^-1Il 2 (r ) E[iVf +2 (r)] 

nhc 2 s n 


From Assumption 3.7.(ii), we deduce that for k large enough 

Y log a (n)E[At(r) fc+1 ] 


E E 


h.&Hn 


sup (^S(/)) S 

f£Br 


< c 


n 


ft remains to verify that E[A(r) fc+1 ] is bounded. Using the fact that for all a > 0, 
b > 0 and p > 1, (a + b) k < 2 k ~ 1 (a k + b k ) and from the Burkholder Inequality, 
we can easily show by recurrence that for all p G N*, E[At(r) fc ] < C *,■ Thus, we 
conclude that for a good choice of p, 


E E 

h&Hn 


sup (^2(/)) S 

/ 6 Br 


< c 


log a (n) 


n 


(45) 


for a constant C > 0. 

Combining (44) and (45), we finally get 


E E 


heHn 


sup i'lh(f) -V(h)/ 10 

/e£ T 


c 6 log a (n) 

<-b C 7 -, 

n n 


where c 6 and c 7 depends on r, ||a 0 ||oo,r, c s , E[e^° Zl ], E[e 2/3 ° Zl ], ||A'|| l i ( r) and ||A'|| L 2 (R) . 

□ 


Remark: A similar lemma can be obtained for the centered process (ahy ~ Qy,y/, /) 2 , 
where * oth and cth,h / = E [ah,h'] for h, h' G "H n . Indeed, from Young Lemma 

2.2 in the supplementary material, we have 

{oth,h' ~ OL h ,h'i f) 2 = f(t) (k hi * a h (t) - E[AV * d ft (f)]^df 

< ||/|| 2 ||A: v *(a fc -E[a h ])|| 2 
<ll/l|2||A'|| Ll(R) ||d,-E[d h ]|| 2 . 

Just take the same constants M, H 2 and W than previously and multiply them by 
\\ K IIlI(R)- 
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The proofs of Lemmas 6.2, 6.3 and 6.6 are available in the supplementary material. 
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